Fix group offloading with block_level and use_stream=True #11375

a-r-r-o-w · 2025-04-21T10:39:43Z

The previous implementation assumed that the layers were instantiated in order of invocation. This is not true for HiDream (caption projection layers are instantiated after transformer layers).

The new implementation makes sure to first capture invocation order and then apply group offloading. In the case of use_stream=True, it does not really make sense to onload more than 1 block at a time, so we also now raise an error if num_blocks_per_group != 1 when use_stream=True

Another possible fix is to simply move the initialization of the caption layers above the transformer blocks.

@sayakpaul @asomoza Could you verify if this fixes it for you?

HuggingFaceDocBuilderDev · 2025-04-21T10:46:23Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

sayakpaul

LGTM! Thank you.

src/diffusers/hooks/group_offloading.py

sayakpaul · 2025-04-21T11:25:53Z

I did some testing and we get the following numbers:

No record_stream

=== System Memory Stats (Before encode prompt) ===
Total system memory:    1999.99 GB
Available system memory:1942.53 GB

=== CUDA Memory Stats Before encode prompt ===
Current allocated: 0.00 GB
Max allocated: 0.00 GB
Current reserved: 0.00 GB
Max reserved: 0.00 GB

=== System Memory Stats (After encode prompt) ===
Total system memory:    1999.99 GB
Available system memory:1932.83 GB

=== CUDA Memory Stats After encode prompt ===
Current allocated: 15.05 GB
Max allocated: 15.05 GB
Current reserved: 15.29 GB
Max reserved: 15.29 GB

=== System Memory Stats (Before transformer.) ===
Total system memory:    1999.99 GB
Available system memory:1917.84 GB

=== CUDA Memory Stats Before transformer. ===
Current allocated: 0.10 GB
Max allocated: 0.10 GB
Current reserved: 0.10 GB
Max reserved: 0.10 GB

=== System Memory Stats (After loading transformer.) ===
Total system memory:    1999.99 GB
Available system memory:1880.56 GB
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [03:30<00:00,  4.20s/it]
latents.shape=torch.Size([1, 16, 128, 128])

=== CUDA Memory Stats After inference with transformer. ===
Current allocated: 0.10 GB
Max allocated: 0.10 GB
Current reserved: 5.68 GB
Max reserved: 5.68 GB

record_stream

=== System Memory Stats (start) ===
Total system memory:    1999.99 GB
Available system memory:1941.94 GB

=== CUDA Memory Stats start ===
Current allocated: 0.00 GB
Max allocated: 0.00 GB
Current reserved: 0.00 GB
Max reserved: 0.00 GB

=== System Memory Stats (Before encode prompt) ===
Total system memory:    1999.99 GB
Available system memory:1940.32 GB

=== CUDA Memory Stats Before encode prompt ===
Current allocated: 0.00 GB
Max allocated: 0.00 GB
Current reserved: 0.00 GB
Max reserved: 0.00 GB

=== System Memory Stats (After encode prompt) ===
Total system memory:    1999.99 GB
Available system memory:1930.62 GB

=== CUDA Memory Stats After encode prompt ===
Current allocated: 15.05 GB
Max allocated: 15.05 GB
Current reserved: 15.29 GB
Max reserved: 15.29 GB

=== System Memory Stats (Before transformer.) ===
Total system memory:    1999.99 GB
Available system memory:1915.65 GB

=== CUDA Memory Stats Before transformer. ===
Current allocated: 0.10 GB
Max allocated: 0.10 GB
Current reserved: 0.10 GB
Max reserved: 0.10 GB

=== System Memory Stats (After loading transformer.) ===
Total system memory:    1999.99 GB
Available system memory:1883.74 GB
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [03:14<00:00,  3.89s/it]
latents.shape=torch.Size([1, 16, 128, 128])

=== CUDA Memory Stats After inference with transformer. ===
Current allocated: 0.10 GB
Max allocated: 0.10 GB
Current reserved: 4.30 GB
Max reserved: 4.30 GB

diffusers-cli env:

- 🤗 Diffusers version: 0.34.0.dev0
- Platform: Linux-5.15.0-1048-aws-x86_64-with-glibc2.31
- Running on Google Colab?: No
- Python version: 3.10.14
- PyTorch version (GPU?): 2.8.0.dev20250417+cu126 (True)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Huggingface_hub version: 0.30.2
- Transformers version: 4.52.0.dev0
- Accelerate version: 1.4.0.dev0
- PEFT version: 0.15.2.dev0
- Bitsandbytes version: 0.45.3
- Safetensors version: 0.4.5
- xFormers version: not installed
- Accelerator: NVIDIA H100 80GB HBM3, 81559 MiB
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in>

tests/hooks/test_group_offloading.py

sayakpaul

Thanks for adding the test! Just two comments.

tests/hooks/test_group_offloading.py

a-r-r-o-w · 2025-04-23T12:47:49Z

Failing tests seem unrelated

DN6 · 2025-04-27T03:48:42Z

src/diffusers/hooks/group_offloading.py

            option only matters when using streamed CPU offloading (i.e. `use_stream=True`). This can be useful when
            the CPU memory is a bottleneck but may counteract the benefits of using streams.
    """
+    if stream is not None and num_blocks_per_group != 1:


This is potentially breaking no? What if there is existing code with num_blocks_per_group>1 and stream=True? If so, it might be better to raise a warning and set the num_blocks_per_group to 1 if stream is True?

Cc: @a-r-r-o-w

Has been addressed in #11425

fix

42088af

a-r-r-o-w requested review from DN6, asomoza and sayakpaul April 21, 2025 10:39

sayakpaul approved these changes Apr 21, 2025

View reviewed changes

src/diffusers/hooks/group_offloading.py Show resolved Hide resolved

add tests

a10e4c0

a-r-r-o-w commented Apr 21, 2025

View reviewed changes

tests/hooks/test_group_offloading.py Show resolved Hide resolved

sayakpaul approved these changes Apr 21, 2025

View reviewed changes

tests/hooks/test_group_offloading.py Outdated Show resolved Hide resolved

tests/hooks/test_group_offloading.py Show resolved Hide resolved

a-r-r-o-w added 2 commits April 23, 2025 14:12

add message check

cd23982

Merge branch 'main' into fix-block-level-stream-offloading

b86cef1

a-r-r-o-w merged commit 6cef71d into main Apr 23, 2025
15 of 16 checks passed

a-r-r-o-w deleted the fix-block-level-stream-offloading branch April 23, 2025 12:47

DN6 reviewed Apr 27, 2025

View reviewed changes

a-r-r-o-w mentioned this pull request Apr 27, 2025

Raise warning instead of error for block offloading with streams #11425

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Fix group offloading with block_level and use_stream=True #11375

Fix group offloading with block_level and use_stream=True #11375

Uh oh!

a-r-r-o-w commented Apr 21, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Apr 21, 2025

Uh oh!

sayakpaul left a comment

Uh oh!

Uh oh!

sayakpaul commented Apr 21, 2025 •

edited

Loading

Uh oh!

Uh oh!

sayakpaul left a comment

Uh oh!

Uh oh!

Uh oh!

a-r-r-o-w commented Apr 23, 2025

Uh oh!

Uh oh!

DN6 Apr 27, 2025

Uh oh!

sayakpaul Apr 30, 2025

Uh oh!

a-r-r-o-w Apr 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Fix group offloading with block_level and use_stream=True #11375

Fix group offloading with block_level and use_stream=True #11375

Uh oh!

Conversation

a-r-r-o-w commented Apr 21, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Apr 21, 2025

Uh oh!

sayakpaul left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sayakpaul commented Apr 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

sayakpaul left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

a-r-r-o-w commented Apr 23, 2025

Uh oh!

Uh oh!

DN6 Apr 27, 2025

Choose a reason for hiding this comment

Uh oh!

sayakpaul Apr 30, 2025

Choose a reason for hiding this comment

Uh oh!

a-r-r-o-w Apr 30, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

sayakpaul commented Apr 21, 2025 •

edited

Loading